Model-Free Preference-Based Reinforcement Learning

نویسندگان

  • Christian Wirth
  • Johannes Fürnkranz
  • Gerhard Neumann
چکیده

Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preferencebased estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences. Introduction One major limitation of reinforcement learning is that a numeric reward function needs to be specified by the user. This is particularly true for complex control tasks as they occur in robotics where the reward function often consists of several hand-tuned terms. Hence, in recent years, the community has worked on rendering reinforcement learning algorithms more applicable by avoiding a hand-coded definition of the reward function. One of these approaches is preference-based reinforcement learning (PBRL). PBRL uses only pairwise preferences over policies, trajectories, states or actions (Akrour, Schoenauer, and Sebag 2012; Wirth and Frnkranz 2013b; Wilson, Fern, and Tadepalli 2012). Many PBRL approaches rely on a model of the system dynamics (Akrour et al. 2014), but often, accurate models are not available and are also hard to learn. Hence, a Copyright c © 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. model-free approach is desirable. Additionally, directed exploration of the utility function is used often, which is hard to perform if the model is unknown or the state-action space is continuous and possibly high-dimensional. In this paper, we show an algorithm for learning a continuous action policy without requiring knowledge of the model or maintaining an explicit approximation of it. The preferences are used to estimate the expert’s utility function. In contrast to traditional PBRL methods, our method is able to use data from interactions with the environment in an online fashion to improve the policy as well as the estimate of the utility function. It can, nevertheless, achieve better performance. The utility function is learned using a Bayesian approach, estimating the uncertainty. An approach that could be used for exploration. Additionally, we compare with a computationally less expensive linear approximation based on ideas of Ng and Russell (2000).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement learning based feedback control of tumor growth by limiting maximum chemo-drug dose using fuzzy logic

In this paper, a model-free reinforcement learning-based controller is designed to extract a treatment protocol because the design of a model-based controller is complex due to the highly nonlinear dynamics of cancer. The Q-learning algorithm is used to develop an optimal controller for cancer chemotherapy drug dosing. In the Q-learning algorithm, each entry of the Q-table is updated using data...

متن کامل

Operation Scheduling of MGs Based on Deep Reinforcement Learning Algorithm

: In this paper, the operation scheduling of Microgrids (MGs), including Distributed Energy Resources (DERs) and Energy Storage Systems (ESSs), is proposed using a Deep Reinforcement Learning (DRL) based approach. Due to the dynamic characteristic of the problem, it firstly is formulated as a Markov Decision Process (MDP). Next, Deep Deterministic Policy Gradient (DDPG) algorithm is presented t...

متن کامل

EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning

Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, actionand reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have bee...

متن کامل

Preference-Based Reinforcement Learning: A preliminary survey

Preference-based reinforcement learning has gained significant popularity over the years, but it is still unclear what exactly preference learning is and how it relates to other reinforcement learning tasks. In this paper, we present a general definition of preferences as well as some insight how these approaches compare to reinforcement learning, inverse reinforcement learning and other relate...

متن کامل

Using Grouped Linear Prediction and Accelerated Reinforcement Learning for Online Content Caching

Proactive caching is an effective way to alleviate peak-hour traffic congestion by prefetching popular contents at the wireless network edge. To maximize the caching efficiency requires the knowledge of content popularity profile, which however is often unavailable in advance. In this paper, we first propose a new linear prediction model, named grouped linear model (GLM) to estimate the future ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016